Assignment

Files Submitted

Table of Contents

  1. Introduction
  2. Business Understanding
  3. Data Understanding
  4. Data Exploration Analysis
  5. Storytelling with Tableau
  6. Final Data Preparation
  7. Decision Tree
  8. Random Forest
  9. Logistic Regression
  10. Limitations
  11. Conclusion
  12. References
  13. Learning Journal

Introduction

Employee attrition is critical to the wellbeing of a company. High employee turnover leads a distruption to workflow and production, limited rare resources such as time and labour. Furthermore, on average it takes $1,200 to recruit and train a new employee. It is in the best interest to maintain low employee turnover and examine employee's motives to stay or leave a company.

CRISP-DM

image.png

Business Understanding

The CRISP-DM model was used to analyse this dataset. The first step the CRISP-DM model is to have a clear Business Understanding, which requires the researcher to understand business objectives and goals, as well as business success and fails. It is also vital to understand what would make this analysis a success and what is the criteria for sucess.

Data Preparation

The next step in CRISP-DM model is the Data Preparation, which means that the dataset has to be cleaned and treated, understanding each column of the data. It is vital to conducted data exploration during this part by conduction descritptive statistics analysis, such as mean and standard deviation of numeric variables, and creating graphs.

Data Modelling

After data preparation, data modelling will be conducted. If the research being conducted is going to be supervised, the researchers must know what is their target or independent variable prior to data mining. This allows the researchers to identify appropriate machine learning (ML) techniques for the dataset. Identifying specific assumptions of each modelling technique is vital for this stage.

Evaluation

As the models are developed, they need to be evaluated. The researchers must be fluid between modelling and evaluation stages because the models can be improved upon once evaluated.


Deployment

Once the model is developed, it is important to organise and present the knowledge and insight gained in a manner that the customer can use it. Regular maintenance has to be scheduled to ensure that the model is up-to-date with the new data being imported into the model to ensure its accuracy.

Business Understanding

The Great Resignation

Since the start of pandemic, there has been labour shortage, which has been coined as “Great Resignation” (Miel, 2021), even individuals in executive ranks of a company have been changing positions (Walsh, 2021). Several factors impact the reasons for employees leaving such as lack of promotional opportunities and inability to work from home.

The cost of a new hire has been examined by Indeed in 2021, which stated that it costs on average $1,200 involving different type of training such as instructor- led training, online learning programs, mentoring and hand-on learning.

Several hidden costs of training new employees have been identified as the:

For several years, the low employee turnover was examined through the lens of business and stakeholders. Businesses aimed to reduce the turnover due to the costs to the business, however, reasons for employee turnover were left unexamined and misunderstood. In the current workplace environment and ‘The Great Resignation’, which are in the post-Covid era, there is a shift in businesses understanding of what motivates and retains great employees, as well as a focus on why great employees leave.

Business Objectives

There are several business objectives that are imported to the business.

Business Success Criteria

Project will be deemed successful if the model has 80% accuracy with the aim of reducing the model predicting that the employee stays but in reality the employee leaves. The model's ability to predict the features of employees who stay can help to examine the difference between employees who left and perhaps change the working environment for those leavers.


Inventory of Resources

IT Carlow Library has a vast number of books on machine learning algorithms, as well as excellent lecturers who can answer difficult questions about machine learning. There will be no other individual working with me on this assignment due to the individual assignment for this module. There will be no updates of data during this project so the data that is gathered at the start is the same data at the end of the project.

Requirements, Assumptions, and Constraints

Several assumptions are made during this project such as data is verified during data mining process and the data is clean. There are no missing values for categorical variables, while a zero in numeric variables is viewed as accurate.

There were several constraints made during this project. The link to data dictionary was not available so it was difficult to understand what somevariables were. Another issue was the size of the dataset, by most standards it was considered rather small, only 1470 observations.

Risks and Contingencies

There are no risks that might delay this project. Like any project, it is a possibility that there is not enough data observations to a lead to best data mining processes, great insights and precises deployment.

Abbreviations

Terminology

There is some important terminology for this project which will be highlighted in this section.

image.png

This terminology is commonly used in evaluation of the model in a confusion matrix, which will be generated for each model to evaluate it using recall, precision and accuracy.

Important evaluation terminology:

Costs and Benefits

The benefit of this project is the identification of some features that could help identify employees who are potentially likely to leave. This could lead to a (costly) intervention to retain those employees. However, this could lead to a better work place environment and better work culture.

Data Mining Goals

The task of data mining goals would be to predict which employees are more likely to leave the company than to stay.

Three different models will be compared to access which model has the best ability to classify the minority class accurately: Decision Tree, Random Forest and Logistic Regression

Reducing False Positives Rationale

Although reducing both false negatives and false positives is important, it was decided to focus on false positives, the model predicting that the employee would stay but in reality they left the company, because the model would fail to predict a turnover of staff, which is exactly the opposite of the desired result.

By focusing on reducing False Positives, it is very possible that this will result in the rise of False Negatives. Although this may seem counter-intuitive, this would be highly useful for us in this project. Taking measures to maximise True Negatives and to reduce False Positives were occur such as reduction in overtime, salary raise and hybrid working model would result in less employees leaving the company. Individuals who were in categorised as False Negatives would benefit from these perks and benefits, which could lead to job satisfaction and environment satisfaction. Although this be viewed as an expensive intervention, HR research and LinkedIn have stated it is very important for the company to have employees' well-being in mind and should be even in the company's core values. HR managers from competitors’ companies continue to attract unhappy employees due to overwork, lack of working hybrid working model, terrible wages in their current jobs.

AUC and ROC

The AUC value (Area Under the Curve) and ROC curve will also be used to evaluate the performance and prediction functions for each model.

The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. When AUC is between 0.5 and 1, there is a high chance that the classifier will be able to distinguish the positive class values from the negative class values. This is so because the classifier is able to detect more numbers of True positives and True negatives than False negatives and False positives.

The receiver operating characteristic (ROC) curve is a graph where the false positive rate is placed on the X-axis and the true positive rate is placed on the Y-axis. The ROC curves are useful to visualize and compare the performance of classifier methods.

Recall and Precision Curve

bah

Data Mining Success Criteria

A model will be deemed successful when precision of a model is 70%, which would mean that the False Positives have been successfully reduced.

Project Plan

A project plan was created to ensure a steady workload and completion by the deadline by 20th April 2022. The plan was devised in such manner that would allow a lot of revisions in case re-evaluation of techniques was decided upon.

Upon an extention of the project, the plan was not revised due the current devised plan going to plan.


image.png

Initial Assessment of Tools and Techniques

Python 3.9 was chosen to prepare and visualise data, as well as build and evaluate models. This will be conducted in Jupyter Notebook. Tableau was chosen to create an in-depth visualisation

Research Question

Several research questions were devised to explore data:

Data Understanding

The dataset consists of past and current employees in a spreadsheet. Dataset was downloaded from https://www.kaggle.com/patelprashant/employee-attrition

According to Kaggle Description, the dataset was previously avaiable on IBM, however, it has been since taken down. https://www.ibm.com/communities/analytics/watson-analytics-blog/watson-analytics-use-case-for-hr-retaining-valuable-employees/

There are no missing values in our dataset.

Data Dictionary

Attribute DataType Description
Age int Age of an employee: range 18 to 60
BusinessTravel text Travel for work: Travel_Rarely, Travel_Frequently, Non-Travel
DailyRate int Daily rate of employee salary
Department text The department that the employee worked in: Sales, Research & Development, Human Resources
DistanceFromHome int Number of miles away from home: range 1 to 29
Education int The education level reached by the employee: 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'
EducationField text The area of study: Life Sciences, Medical, Marketing, Technical Degree, Human Resources, Other
EmployeeCount int Unclear from data dictionary
EmployeeNumber int Employee number of the employee in the dataset
EnvironmentSatisfaction int 1 'Low', 2 'Medium', 3 'High', 4 'Very High'
Gender text Gender of employee: Male or Female
HourlyRate int Hourly Rate of Employee
JobInvolvement int 1 'Low', 2 'Medium', 3 'High', 4 'Very High'
JobLevel int 1 'Low', 2 'Medium', 3 'High', 4 'Very High'
JobRole text Employee's job title: Sales Executive, Research Scientist, Laboratory Technician, Manufacturing Director, Healthcare Representative, Manager, Sales Representative, Research Director, Human Resources
JobSatisfaction int 1 'Low', 2 'Medium', 3 'High', 4 'Very High'
MaritalStatus text Employee's marital status: Single, Married, Divorced
MonthlyIncome int Employee's monthly salary
MonthlyRate int Unclear from data dictionary
NumCompaniesWorked int Number of company previously worked for
Over18 text Employee's Over-18 status: Y
OverTime text Employee's overtime status: Yes, No
PercentSalaryHike int Percent of Salary Hike
PerformanceRating int 1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding'
RelationshipSatisfaction int 1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding'
StandardHours int Unclear from data dictionary
StockOptionLevel int Unclear from data dictionary
TotalWorkingYears int Number of total working years
TrainingTimesLastYear int Number of training times last year
WorkLifeBalance int 1 'Bad', 2 'Good', 3 'Better', 4 'Best'
YearsAtCompany int Number of total working years in the company
YearsInCurrentRole int Number of total working years in the current role
YearsSinceLastPromotion int Number of total working years since last promotion
YearsWithCurrManager int Number of total working years with current manager

Please note: DistanceFromHome was assumed to be in miles beacuse the data is from IBm, which is an American company. This was not provided in the data dictionary.

Data Dictionary: Target Variable
Attribute DataType Description
Attrition text Did the employee leave or not: Yes or No?

Data Preparation

Before the data was modelled, the data was processed and prepared for modelling. This include changing relevant variables to categories, levels of categories were reordered if required. After some deliberations, several variables were dropped.

Marital Status

Column 'MaritalStatus' was converted to a categorical variable, the categories were reordered, this becomes important in graphs.

Education

Numerical values in column 'Education' was changed to corresponding categorical value, it was converted to a categorical variable and the categories were reordered, this becomes important in graphs.

Redundant values

The following columns were spotted as redundant due to lack of variety of values in the column.

Dropping Reduntant Columns

There are several columns that were unclear from the initial data dictionary or were too granular. Here is the table of columns dropped and the reasons for dropping them.

Attribute Reason
EmployeeCount No data description was provided and all values are '1'
Job Level No data description was provided
EmployeeNumber Redundant - in pandas index is used
Over18 All employees are/were over 18
StockOptionLevel No data description was provided
PerformanceRating Only two values: Excellent and Outstanding. Unclear what is the difference between the two
MonthlyRate Too granular
HourlyRate Too granular
DailyRate Too granular

Target Variable: Attrition

The following breakdown of the 'Attrition' column was observed. This means that, in the dataset, the number of employees stayed in the company is more than the number of employees that left the company.

Attrition # of Employees
Stayed (No) 1233
Left (Yes) 237

The dataset is unbalanced because there isn't the same number of employees in each class (stayed or left). This can make modelling difficult and inaccurate, which would rrequire upsampling or downsampling.

Data Exploration Analysis

Plotly's interactive graphs, was used to visualise categorical variables and seaborn was used to create pairplots, which examines correlations and distributions, and heatmaps.

Tableau Dashboard was also used to visualise and built a story with the dataset and is embedded into this Jupyter Notebook.

Categorical variables

This section was used to examine the distribution of leavers and stayers by different categorical variables, rather than given proper insight.

Attrition

Insight

The leavers are classified as the minority class in this project due to the small number of observations, which the stayers are classified as the majority class due to the much higher number of observations.

Impact of Education on Attrition

Insight

Graph shows the distribution of employees by category, rather than given proper insight.

Impact of Education Field on Attrition

Insight

Graph shows the distribution of employees by category, rather than given proper insight.

Impact of Business Travel on Attrition

Insight

It seems that employees who do not require business travel leave the company the least.

Impact of Department on Attrition

Insight

Graph shows the distribution of employees by category, rather than given proper insight.

Impact of Gender on Attrition

Insight

Due to the fact that more men are employed in the company than women, this means that there is same proportion of men and men leaving. Graph shows the distribution of employees by category, rather than given proper insight.

Impact of over Time on Attrition

Insight

There is a not big difference betweeen stayers and leavers on working overtime. However, this means that working overtime is not the only factors that make employees leave.

Impact of Job Role on Attrition

Insight

Several roles have higher employee turnover than others. For example, nearly half of all Sales Representatives have left their position. Meanwhile, there is a higher rate of turnover in Sales Executives, Research Scientists and Labotary technicians than other positions.

Impact of Marital Status on Attrition

Insight

There seems to be more leavers who are single, rather than those who are married and divorced. This is an interesting observation, which will have to be explored more.

Perhaps, single people are between 20-35 and are on low wage and are leaving to get a higher income. This will have to be further explored.

Impact of Work Life Balance on Attrition

Insight

This graph is interesting, because 766 stayers state that they have a bad work life balance, while only 127 actually left the company.

Fifty-five stayers voted that they have the 'best' work life balance, meanwhile, 25 leavers voted the same, which is nearly half of the employees who voted that they have the best work life balance. This is interesting because this slightly suggest that having the best life balance does not stop employees from leaving

Reconstructing Work Life Balance

It was decided to reconstruct the categories in Work Life Balance. Due to the, it was decided to place 'Better' and 'Best' in 'Good'. It was seen as semantically, there is not that much of a difference between the three categories and it will help to condense the categories.

Impact of Job Satisfaction on Attrition

Insight

The number of leavers seems to be consisted throughout the categories. A high number of leavers stated that high job satisfaction still left the job.

Impact of Work Life Balance on Attrition

Exploring Correlation

Correlations between pairs of features were examined by running a tabulated correlations of numerical values, followed by a heatmap for easier visualisation.

Correlation observation

There was a correlation between YearsAtCompany and YearsInCurrentRole (r = 0.76), YearsAtCompany and YearsSinceLastPromotion (r = 0.62), YearsAtCompany and YearsWithCurrManager (r = 0.77), YearsInCrrentRole and YearsSinceLastPromotion (r = 0.55), YearsInCurrentRole and YearswithCurrManager (r = 0.71).

Scatterplots and Histograms

Pairplot was used to examine the visualise correlations through scatterplots and examine distribution through histograms.

Skewness and Kurtosis

Skewness and Kurtosis explore whether the data is normally distributed. Skewness exammines whether the datas is curved to the left or to the right, leading to be distorted or assymetric, while kurtosis examines the peak of the distribution: is too flat or too narrow.

Both are deviation from normal distribution, which looks like a bell, hence the name: bell curve.

Observation about Skewness and Kurtosis

George & Mallery (2010) consider measures between -2 and +2 to be appropriate and so according to this criteria, no variables are seen not meeting this criteria.

Hair et al. (2010) and Byrne (2010) consider measures of ‐2 to +2 for skew and -7 to +7 for kurtosis, to suggest normal distribution of the variables - no variables are seen not meeting this criteria.

Variance Inflation Factor (VIF)

Due to the high correlation between some variables, it was decided to run VIF to analyse the multicollinearity of the variables. Multicollinearity can produce estimates of the regression coefficients that are not statistically significant. When two or more independent variables are highly correlated, it becomes difficult to state which variable is really influencing the independent variable (Gil, Sousa and Verleysen, 2013).

EDA Conclusion

Based on the above exploration, the following columns were dropped: 'YearsWithCurrManager' and 'YearsInCurrentRole' due to the high correlation with other variables.

VIF showed a limited multicollinearity, which did not require further removal of variables.

Skewness and kurtosis also showed normally distributed data so that means that data does not require further treatment to become normally distributed.

Correlations and VIF were rerun again to ensure that correlations and multicollinerity was dealt with.

Story Telling with Tableau

Tableau Dashboard was used to visualise the interaction between features in the dataset and is embedded into this Jupyter Notebook.

Click on different bars to filter the data and hover over the bars or the scatterpoints for more information.

Tableau Dashboard Link

Observations from Tableau Dashboard

Final Data Preparation

After dropping irrelavant and highly correlated variables, there were 23 variables left with 1470 observations in the dataset.

Dummy Variables

The variables that had categorical values such as 'Marital Status' which had 'Single', 'Married', 'Divorced', was split into 3 columns, one for each of the categories represented in the data and converted to integers.

If the employee was married, the value of 1 was assigned to MaritalStatus_Married, and 0 assigned to all other 2 MaritalStatus columns (MaritalStatus_Single, MaritalStatus_Divorced). This increased the overall column count to 48 as our final_data variable shows.

Dummy Variable reduction

Feature Description Value
BusinessTravel_Travel_Frequently Does not travel frequently 0
BusinessTravel_Travel_Rarely Does not travel raretly 0
BusinessTravel_Non-Travel Does not travel for work 1

With dummy variables it is possible to drop a variable because BusinessTravel_Travel_Frequently' and 'BusinessTravel_Travel_Rarely' are 0, then that means 'BusinessTravel_Non-Travel' is 1. One category is left out, and missing category is called the reference category. Using the reference category makes all interpretation in reference to that category. It also unsures multicollinearity in the variables.

In the end there are 3 variables in stead of 3 variable ad reduces the number of columns we have.

Feature Description Value
BusinessTravel_Travel_Frequently Does not travel frequently 0
BusinessTravel_Travel_Rarely Does not travel raretly 0

The same was conducted with other variables relating to 'Education', 'Department', 'JobRole' an 'MaritalStatus'

Final Data Data Dictionary

Attribute DataType Description
Age int Age of an employee: range 18 to 60
DistanceFromHome int Number of miles away from home: range 1 to 29
JobLevel int Range: 1 to 5
MonthlyIncome int Monthly Income of the Employee: Ranges from 1009 to 19999
NumCompaniesWorked int Number of companies the employee had worked for: Ranges from 0 to 9
PercentSalaryHike int Percentage of Salary Hike
TotalWorkingYears int Total Working Years
TrainingTimesLastYear int The number of training times last year: Ranges from 0 to 6
YearsAtCompany int Number of years at the current company
YearsSinceLastPromotion int Number of years since last promotion: Ranges from 0 to 15
WorkLifeBalance int Bad = 1, Good = 2, Better = 3, Best = 4
EnvironmentSatisfaction int Low = 1, Medium = 2, High = 3, Very High = 4
JobInvolvement int Low = 1, Medium = 2, High = 3, Very High = 4
JobSatisfaction int Low = 1, Medium = 2, High = 3, Very High = 4
Gender Binomial Female = 0, Male = 1
OverTime Binomial No = 0, Yes = 1
RelationshipSatisfaction Binomial No = 0, Yes = 1
BusinessTravel_Non-Travel Binomial No = 0, Yes = 1
BusinessTravel_Travel_Frequently Binomial No = 0, Yes = 1
BusinessTravel_Travel_Rarely Binomial No = 0, Yes = 1
Department_Human Resources Binomial No = 0, Yes = 1
Department_Research & Development Binomial No = 0, Yes = 1
Department_Sales Binomial No = 0, Yes = 1
Education_Below College Binomial No = 0, Yes = 1
Education_College Binomial No = 0, Yes = 1
Education_Bachelor Binomial No = 0, Yes = 1
Education_Master Binomial No = 0, Yes = 1
Education_Doctor Binomial No = 0, Yes = 1
EducationField_Human Resources Binomial No = 0, Yes = 1
EducationField_Life Sciences Binomial No = 0, Yes = 1
EducationField_Marketing Binomial No = 0, Yes = 1
EducationField_Medical Binomial No = 0, Yes = 1
EducationField_Other Binomial No = 0, Yes = 1
EducationField_Technical Degree Binomial No = 0, Yes = 1
JobRole_Healthcare Representative Binomial No = 0, Yes = 1
JobRole_Human Resources Binomial No = 0, Yes = 1
JobRole_Laboratory Technician Binomial No = 0, Yes = 1
JobRole_Manager Binomial No = 0, Yes = 1
JobRole_Manufacturing Director Binomial No = 0, Yes = 1
JobRole_Research Director Binomial No = 0, Yes = 1
JobRole_Research Scientist Binomial No = 0, Yes = 1
JobRole_Sales Executive Binomial No = 0, Yes = 1
JobRole_Sales Representative Binomial No = 0, Yes = 1
MaritalStatus_Single Binomial No = 0, Yes = 1
MaritalStatus_Married Binomial No = 0, Yes = 1
MaritalStatus_Divorced Binomial No = 0, Yes = 1

Please note: DistanceFromHome was assumed to be in miles, this was not provided in the data dictionary.

Target Variable

Attribute DataType Description
Attrition Binomial Did the employee leave or not: Yes or No?

Normalizing dataset

There are a lot of range betweenvariables such as Age (M = 36.92, SD = 9.13, Range = 18 - 60) and Monthly Income (M = 6 502.93, SD = 4 707.96, Range = 1 009 - 19 999)

df[['Age', 'MonthlyIncome']].describe()

image.png

Due to the large variation in variables for mean and standard deviation, the data was normalised, so that each variable has a mean of 0 and a variance of 1. This is performed to ensure that there is equal importance placed on all the features.

If this is not performed, then the machine learning model would assume that the ‘MonthlyIncome’ was of more significance than the other variables such as 'Age' – simply because of its higher values.

Numeric Variables

Only numeric variables were normalised such as Age and Monthly Income.

Categorical Variables

Variables such as Gender, which were converted to 0 and 1, were excluded from normalisation because O is for Females and 1 is for Males (alphabetically assigned).

Variables that were created with dummy variables, such as MaritalStatus_Single, MaritalStatus_Married and MaritalStatus_Divorces, also had 0 and 1 values, were excluded from normalisation because O is for No and 1 is for Yes (alphabetically assigned).

Training and Test Set

The dataset was split into training and test sets. Training set was taken as 70%, while test set was 30% of the dataset (X_train = 1029, X_test = 441). This is a typical standard split for data analysis.

SMOTE

The training set data is unbalanced. The majority class is No with 1029 observations, implying that the employee stayed in the company. The minority class is Yes with 441 observations, imlying the employee left the company. This means that data will have to be balanced. If the data was not balanced, then the model built would be good at predicting the majority, but not the minority class.

It was decided to oversample due to the small number of minority class, to bring the number of YES (employee left) samples up to the level of NO (employee stayed) samples. A technique called SMOTE is used to balanced the datase: Synthetic Minority Oversampling Technique. This creates artificial samples that are similar to existing ones, and inserts them into the existing samples in the minority class. Balancing was conducted on the training data, not the test set. The algorithm needs to learn as much as it can from the training set, to be used on the test data – as this would almost always be imbalanced as in the real world.

If under sampling had been undertaken instead, this would have led to a smaller number of the majority class, so less for the algorithm to learn from.

Stratified random sampling was used because it ensures that there is the same ratio of majority and minority class as in the whole dataset. Due to the small amount of minority class, it is vital that there is the correct ratio in test set, otherwise the models will overlearn the majority class.

The results of the oversampling show that each class is now balanced: No = 1029, Yes= 1029.

Model

The following supervised ML classification algorithms were chosen to predict the minority class:

Fine-Tuning Models

Features

During the process of this project, the features below were selected due to the fact that the models were built on those speicifc variables. When the features that are hashed out where manually placed into the model, they were not show to be significant. It was decided to drop those features and built initial models and fine tuned from those features.

Decision Tree

Decision Tree creates a tree structure to model the relationships among the predictors and the predicted outcome (Lantz, 2015);

The 'entropy' was means using 'information game' to split variables. Max depth only going to 5 levels because otherwise the model would overfit. The optimal level is unknown so it was decided to choose 5 as an experiment.

Feature imporance was also examine to establish the features that are most important in predicting which employees will leave. Variables with values 0 means they weren't use to construct the decision trees, because only level 5 was used.

Evaluation of Decision Tree Model

After running the initial Decision Tree Model, model's accuracy was 71%.

Precision is 26% which means that the accuracy of predicting a True Negative is less than chance or a flip of unbiased coin. This means that our initial DT model is not a good at predicting true negatives and the fine tuned model needs to focus reducing false positives (reducing predicting stayers instead of leavers).

The Area Under the Curve (AUC) is 64%, which is an average marker for predicting True Positives (employees who stayed), however, the recall and the precision curve is not great at predicting True Negatives and reducing False Positives.

Recal-Precision Curve shows a very disappointing outcome at predicting recall.

Important Features

Best Features

The following features were identified as signicant from the initial Decision Tree

These features were used to be identify the best parameters for the Decision Tree.

To identify the best number of features, the variables that scored the lowest were dropped, resulting in the model being rerun and re-evaluated. The table below summarises the number of variables remaining and the accuracy and other metrics for each model. It was decided that 14 variables was best suited because both accuracy and precision (reducing the number of false positives) were the highest.

image.png

Best Parameters for Decision Tree Model

After running the 10-fold CV grid search, this is the parameters that would suit best for the Decision Tree. Of course, the code is rerun, it is possible that the modelt will suggest other parameters.

The Decision Tree was fine-tuned based on this screenshot.

Fine-Tuned Decision Tree Important Features

Cross Validation of Decision Tree

Evaluation of Cross-validation of Decision Tree

Cross-Validation is a great to examine the evaluate the accuracy of a decision tree model. The cross validation was set to score precision because our aim is to reduce False Positives. The mean cross validation score of decision tree is 73% (SD Cross Validation ± 3%), while accuracy of the fine-tuned decision Ttee model without cross validation is 77%.

Although the accuracy for cross-validation was reduced, this means that cross validation provides a better estimate of accuracy because the performance is based on unseen data.

Evaluation of Fine-Tuned Decision Tree

After dropping different variables, optimal number of varaibles was found to be 14 variables because of the high accuracy and, unfortunately, slight higher precision, which still worse than chance. image.png

Significant Features

These 14 variables were seen as the best predictors:

Features that were significant were Marital Status_Divorced, age, whether the employee worked over time, studied medicine, distance from home amd the number of years since last promotion.

Accuracy

Even though several features were dropped and the best paramaters for DT were found, the accuracy of the fine-tuned decision tree model improved slight to 77%.

Precision

Although the precision increased, it remained below the chance of a flip of unbised coin (with 30%) which means that the fine-tuned decision tree is still fails to reduce False Positives (the employees who left but where predicted to stay).

AUC Curves

The AUC is 72%, which means that model is adequate at predicting True Positives (employees who stayed). Although the fine tuned DT precision-Recall curve is slightly better than the initial DT precision-Recall curve, it still below the diagonal (which is the 50% chance). The fine tuned DT failed to adequately predict True Negatives (truly left) and False Positves (predicted to have stayed but really left).

Conclusion of Decision Tree

Accuracy

Initial Decision Tree Model accuracy was 71%. After dropping features and the grid search for the best parameters, it was improved to 77%.

Precision

In the initial model, the precision is 26% and it was improved to 37%, which is still very inaccurate at predicting True Negatives and reducing False Negatives.

Curve

The inittial Area Under the Curve (AUC) was 64%, and the fine tuned model was 72%, which is a very positive improvement! Unfortunately, Recall-Precision Curve failed to improve even when the model was fine tuned.

Random Forest

Evaluation of Random Forest

Accuracy of the model is 84%, however, precision is 50% This means that the model predicts the majority class (employees who have stayed) well, but it is appalling at predicting the minority class (those who have left).

Although ROC AUC Curve is 72%, the recall-precision curve is still below the diagonal line, meaning that the RF model is still failing to overestimates the number of people who stayed.

Important Features

Best Features

The following features were identified as signicant from the Random Forest

These features were used to be identify the best parameters for the Random Forest.



To identify the best number of features, the variables that scored the lowest were dropped, resulting in the model being rerun and re-evaluated. The table below summarises the number of variables remaining and the accuracy and other metrics for each model. It was decided that 20 variables was best suited because both accuracy and precision (reducing the number of false positives) were the highest.

image.png

Best Parameters for Random Forest

After running the 10-fold CV grid search, this is the parameters that would suit best for the Random Forest. Of course, the code is rerun, it is possible that the modelt will suggest other parameters.

The Random Forest was fine-tuned based on this screenshot.

image.png

Fine-Tuned Random Forest

Fine-Tuned Random Forest Important Features

Evaluation of Fine-Tuned Random Forest

After dropping different variables, optimal number of varaibles was found to be 20 variables because of the high accuracy and, unfortunately, precision is still very worse than chance and, once again, we failed to meet our target of 70%.

image.png

Significant Features

These 20 variables were seen as the best predictors:

Some similar features were significant in RF as DT: Marital Status_Divorced, age, whether the employee worked over time, age, studied medicine, distance from home amd the number of years since last promotion.

Accuracy

Even though several features were dropped and the best paramaters for DT were found, the accuracy of the fine-tuned decision tree model improved slight to 833%.

Precision

Although the precision increased, it remained below the chance of a flip of unbised coin (with 47%) which means that the fine-tuned decision tree is still fails to reduce False Positives (the employees who left but where predicted to stay).

AUC Curves

The AUC is 72%, which means that model is adequate at predicting True Positives (employees who stayed). Although the fine tuned DT precision-Recall curve is slightly better than the initial DT precision-Recall curve, it still below the diagonal (which is the 50% chance). The fine tuned DT failed to adequately predict True Negatives (truly left) and False Positves (predicted to have stayed but really left).

Cross-Validation of Random Forest

Evaluation of Cross-validation of Random Forest

Cross-Validation is a great to examine the evaluate the accuracy of a random forest model. The cross validation was set to score precision because our aim is to reduce False Positives. The mean cross validation score of random forest is 92% (SD Cross Validation ± 2.5%), while accuracy of the fine-tuned decision Ttee model without cross validation is 95%.

Although the accuracy for cross-validation was reduced, this means that cross validation provides a better estimate of accuracy because the performance is based on unseen data.

Conclusion of Random Forest

Accuracy

Initial Random Forest Model accuracy was 84%. After dropping features and the grid search for the best parameters, it did not improve, rremained at 83%.

</br> Precision

In the initial model, the precision is 50% and dropped to 44%, highlighting its inaccuracy at predicting True Negatives and reducing False Negatives.

</br> Curve

Both initial Area Under the Curve (AUC) and the fine tuned model was 72%! Unfortunately, Recall-Precision Curve failed to improve even when the model was fine tuned.

</br> Overall

Random Forest was better at predicting True Positives than Decision Tree. This is very common because Random Forest were builds several DT and votes on the most accurate DT.

Logistic Regression

Evaluation of Logistic Regression

After running the initial Logistic Regression Model, model's accuracy was 78%.

Precision is 37% which means that the accuracy of predicting a True Negative is less than chance or a flip of unbiased coin. This means that our initial DT model is not a good at predicting true negatives and the fine tuned model needs to focus reducing false positives (reducing predicting stayers instead of leavers).

The Area Under the Curve (AUC) is 72%, which is an average marker for predicting True Positives (employees who stayed), however, the recall and the precision curve is not great at predicting True Negatives and reducing False Positives.

Recal-Precision Curve shows a very disappointing outcome at predicting recall.

Recursive feature elimination (RFE)

Evaluation

We failed to reach some of our objectives. All three fine-tuned models had high accuracy of over 75% with RF having the highest accuracy with 83% and was build with 20 variables. However, all models failed to reduce False Positives and overfit the majority class (stayers). This means that the dataset overlearned the majority class in the training set.

There is some relationship between job roles, salary, and working overtime increase employee turnover. It is very difficult to make a very certain clear line because of the models overlearning the majority class.

image.png

Limitations and Future Suggestions

There are several limitations to this study that could explain our accuracy and high number of False Positive.

Small Dataset

The best option is to collect more data. It is highly likely that since the data was collected, there have been more people who have opted to leave the company and more people have been hired. This would lead us to having a bigger dataset, which would result in better modelling, due to the small dataset with only 1,233 observations and 217 observations in minority class (employees who left).

Due to the small dataset, the training and test sets were even smaller even though upsampling using SMOTE technique was used to create more of minority class in training set artificially. This led to our data overfitting and predicting the majority class exceptionally well, however, the aim of our research is to predict the minority class well. Only 16% of the workforce left, meaning that 84% stayed, which means that if the employer does not nothing to improve workplace environment, increase salary or help employees with their work-life balance, only 16% of employees will leave and our model will continue to predict majority class well.

Further Fine-Tuning the Model

There were several limitations to our models:

  1. There is a possibility of binning numeric variables such as Monthly Income by examining the distribution and perhaps creating a new variable “Below Median Salary” and “Above Median Salary” to examine further whether monthly income would have more of an influence on the data. Other numerical variables such as ‘YearsatCompany’ could be divided into ‘Less than 3 years’, ‘Between 3-5 years’, ‘Between 5 – 10 years’ and ‘Above 10 years’.
  2. Once new attributes were created, redundant columns were dropped and the data was cleaned, leaving 40 attributes in the dataset. Although correlations and VIF were run, no other method was used for further feature selection to further minimise the attributes in the datasets. Features were dropped because there were not seen as significant in any of the models ran and removed. By narrowing the number of variables even further could have made the models more accurate at predicting employees who left and reduce False Positives. Some models do not handle many variables well so it would be highly recommended to further reduce the number of variables in the dataset.
  3. The SMOTE technique was used to upsample the minority class and stratifying setting was used to ensure dataset’s ratio of stayers to leavers. Finding other methods to upsample and stratifying would be highly recommended.
  4. Although the data was stratified when split into training and test data, shuffling the data was not coded into the splitting the data. This would be best practise because it is possible that during the splitting of the dataset, the training set has less employees who are lobotomy technicians than the test set.
  5. Although the models were fine-tune hyperparameters with grid search, decision tree and random forest models were not programmed to use weighted features.

Other techniques could have been conducted to fine tune the models in Decision Tree and Random Forest such as AdaBoost Classification, Gradient Boosting, as well as Out-Of-Bag and Boostrapping.

  1. Bootstrapping is when the multiple decision trees in Random Forest are made using random sampling with replacement. This could be vital for our models to ensure that there are more samples resulting in more minority cases and more accurate True Negativ (Kunchhal, 2020).
  2. Out-of-Bag is similar to Bootstrap, however, in this method, the observations are chosen randomly and with replacement. The observations that are not in a sample are known as OUT-OF-BAG points. This could be vital for our models to ensure that there are more variety in samples, and may lead to better detection of signification variables, resulting in more accurate True Negative (predicted leavers) (Kunchhal, 2020).
  3. In AdaBoost Classification, each predictor pays more attention to wrongly predicted instances by the previous predicting model. This is achieved by changing the weights of training instances and coefficient is assigned from each predictor. This is highly depended on the predictor's training error. This would have been highly useful to us because of our target to reduce false positives (Kumara, 2020).
  4. Gradient Boosting focuses on building accuracy by correcting of examining the previous model’s error. Although, it does not tweak the weights of training instances, it is trained by using each predecessor's residual errors as labels. It consists of the ensemble consists of N trees (Kumara, 2020). This could be used in our case to improve True Negatives and reduce False Positives.
  5. Other evaluation metrics should have been used such as Root Mean Square Error (RMSE) which is the standard deviation of the prediction errors, also known as residuals. The error is the distance between the residuals to the regression line data points. 6.6 No other fine-tuning was conducted on the logistic regression, which could be one of the biggest reasons why the logistic regression underperformed.


It is highly recommended to address these needs and re-evaluating the models before deployment. Many limitations above are common practise for Data Analysts (Xu, Zhang and Li, 2011) and would make our models more accurate at predicting employees who leave. If these shortcomings are not addressed, then it is possible that an inaccurate model will be deployed, giving incorrect metrics that could be deployed in attempts to reduce high turnover rate, potentially damaging the reputation of the company.

Deployment

Due to our results, we would highly recommend to NOT deploy these models into the working environment. The next stage should be to revise and re-evaluate HR strategies such as collect more data and examine the some issues with the dataset.

Psychology of Working Environment

It is important to remember that employees motives and issues are not captured during this dataset. Even though our data visualisation analysis showed that early half of laboratory technicians left the worklplace, it did not highlight the reasons or the employee’s motives. Over half of all laboratory technicians work overtime and only 26% have left the workplace.

Our analysis can only lead us so far because our data might not capture all issues in workplace. Perhaps, their manager was highly controlling and has created a toxic working environment that leads to employees working overtime. This might be true or false, however, this is not captured in the data and, therefore, a clear-cut reasons why laboratory technicians leave. Interviews and focus groups will have to be conducted to examine the reasons why employees resign.

Our analysis highlighted a pattern and found a relationship. Unlike other issues, this involves humans with emotions and their personal goals and motives that will have to examined to understand the real reasons why they leave. Some models were built using the employees’ marital status, finding them significant. During the initial stages when the algorithms were being built, when the marital status columns were removed, the algorithms performed worse. This could suggest that there is some sort of relationship. Perhaps, divorced employees wish to relocate to move home or having mental health issues due to the stress of the divorce, or married people wish to spend more time with their children. It is unclear what is the real reason that marital status has this an effect on the algorithms. It would be advised that the HR examines their flexibility around ‘Working From Home’.

Exit Interviews/ Surveys

Conducting exit interviews for employees before their leave could be a great method to highlight the reasons why the individual has decided to resign from their posts. The answers will have to analysed using thematic analysis, commonly used in psychology, and is viewed as a scientific method for analysing interviews. It would be highly recommended to use external HR manager to conduct the interview so that the leaving employee are more willing to be honest about the reasons.

Meanwhile, surveys are quantitatively conducted, meaning that the HR managers can conduct statistical analysis in Python or Tableau to represent the data and can highlight common themes in the employees who leave.

Focus Group

Bi-annually HR managers should conduct focus groups to examine the common themes that occur in employee conversations. Once again, this should be conducted with an external HR manager, so that the employees are more likely to be honest. Prior to COVID-19, office perks such as free fruit and nap pods were seen as caring gestures of the FAANG (Facebook, Amazon, Apple, Netflix and Google) (Cassidy, 2017; Shine Workplace Wellbeing, 2019).

Conclusion

Reference

Byrne, B. M. (2010) Multivariate Applications Series: Structural Equation Modelling with AMOS: Basic Concepts, Applications, and Programming. 2nd edn. United States of America: Taylor and Francis Group, LLC.

George, D. and Mallery, P. (2010) SPSS for Windows Step by Step: A Simple Guide and Reference. 10th edn. Pearson, Boston.

Hair, J. F., Black, W. C., Babin, B. J., and Anderson, R. E.(2010) Multivariate Data Analysis: Overview of Multivariate Methods. 7th edn. New Jersey: Pearson Education International.



Brownlee, J. (2019) ‘How to Calculate Precision, Recall, F1, and More for Deep Learning Models’, Machine Learning Mastery. Available at: https://machinelearningmastery.com/how-to-calculate-precision-recall-f1-and-more-for-deep-learning-models/ (Accessed: 13 April 2022).

Cassidy, A. (2017) ‘Clocking off: the companies introducing nap time to the workplace’, The Guardian. Available at: https://www.theguardian.com/business-to-business/2017/dec/04/clocking-off-the-companies-introducing-nap-time-to-the-workplace (Accessed: 9 April 2022).

Kumara, V. (2020) A Guide To Understanding AdaBoost, Paperspace Blog. Available at: https://blog.paperspace.com/adaboost-optimizer/.

Kunchhal, R. (2020) Out of Bag Score | OOB Score Random Forest Machine Learning. Available at: https://www.analyticsvidhya.com/blog/2020/12/out-of-bag-oob-score-in-the-random-forest-algorithm/ (Accessed: 29 April 2022).

Miel, R. (2021) ‘Welcome to “The Great Resignation”’, Plastics News, 11 October. Available at: https://search.ebscohost.com/login.aspx?direct=true&AuthType=ip,shib&db=edsbig&AN=edsbig.A678946036&site=eds-live&scope=site&custid=s4214462 (Accessed: 22 February 2022).

Precision-Recall (no date) scikit-learn. Available at: https://scikit-learn/stable/auto_examples/model_selection/plot_precision_recall.html (Accessed: 13 April 2022).

Shine Workplace Wellbeing (2019) ‘Free fruit at work - advantages of offering healthy snacks to staff’, Shine Workplace Wellbeing. Available at: https://www.shineworkplacewellbeing.com/free-fruit-at-work/ (Accessed: 9 April 2022).

sklearn.metrics.f1_score (no date) scikit-learn. Available at: https://scikit-learn/stable/modules/generated/sklearn.metrics.f1_score.html (Accessed: 13 April 2022).

Vorhies, W (2016) CRISP DM Model [Photograph]. Available at: https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome (Accessed: 30 January 2022)

Walsh, D. (2021) ‘The Great Resignation’ hits the executive ranks, too. Available at: https://eds.s.ebscohost.com/eds/detail/detail?vid=5&sid=7331b78a-4e25-45f2-b256-ea9615b44488%40redis&bdata=JkF1dGhUeXBlPWlwLHNoaWImc2l0ZT1lZHMtbGl2ZSZzY29wZT1zaXRl#AN=edsbig.A682030477&db=edsbig (Accessed: 22 February 2022).

Xu, G., Zhang, Y. and Li, L. (2011) Web Mining And Social Networking. 1st ed. New York: NY Springer.

Appendices

Learning Journal

Week 1

24/01/2022 - 30/01/2022

Chose my dataset, examined business interests in conducting research. looked as data structure, examined variables, did quick visualisations of data with pandas.

Week 2

31/01/2022 - 06/02/2022

Decided to use CRISP-DM as a frame work to analyse the dataset. Read up on the CRISP-DM, write a bit of introduction.

Week 3

07/02/2022 - 13/02/2022

Conducted Business Understanding and CRISP-DM write up for the first part of analysis.

Week 4

14/02/2022 - 20/02/2022 Read up on Data Preparation for my dataset and thought about data exploration techniquest that I should use.

Conducted Data Preparation such as converting values to categorical variables or numeric.

Created plotly graphs to examine the distribution of numeric values.

Week 5

21/02/2022 - 27/02/2022 Watched tutorials about Tableau and read up on best practises for data storytelling. Created Tableau Dashboard to visualise the data differently.

Week 6

28/02/2022 -13/03/2022 Read up on feature selection using VIF and correlations. Wrote code for VIF and correlations, examined results and concluded accordingly. Wrote up the resutls.


Week 8

14/03/2022 - 20/03/2022 Started exploring final data preparation such as dummy variables, SMOTE-ing and splitting the dataset into training and test sets.

Week 10

21/03/2022 - 27/03/2022 Read up about Decision Trees - best practises and pitfals.
Created Decision Tree, evaluated performance and fine-tuned DT.

Week 11

28/03/2022 - 03/04/2022 Read up about Random Forest - best practises and pitfals. Created Random Forest, evaluated performance and fine-tuned RF.

Week 12

04/04/2022 - 10/04/2022 Read up about Logistic Regression - best practises and pitfals. Created Logistic Regression, evaluated performance and fine-tuned LG.

Week 13

11/04/2022 - 29/04/2022 Write up of each algorithm, fixing Business Understanding. Thinking about limitations and future research. Final write up and final fine tuning of models.